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Abstract 


We discuss the use of the determinantal point process (DPP) as a prior for 
latent structure in biomedical applications, where inference often centers on the 
interpretation of latent features as biologically or clinically meaningful structure. 
Typical examples include mixture models, when the terms of the mixture are meant 
to represent clinically meaningful subpopulations (of patients, genes, etc.). Another 
class of examples are feature allocation models. We propose the DPP prior as a 
repulsive prior on latent mixture components in the first example, and as prior 
on feature-specific parameters in the second case. We argue that the DPP is in 
general an attractive prior model for latent structure when biologically relevant 
interpretation of such structure is desired. We illustrate the advantages of DPP 
prior in three case studies, including inference in mixture models for magnetic 
resonance images (MRI) and for protein expression, and a feature allocation model 
for gene expression using data from The Cancer Genome Atlas. An important part 
of our argument are efficient and straightforward posterior simulation methods. 
We implement a variation of reversible jump Markov chain Monte Carlo simulation 
for inference under the DPP prior, using a density with respect to the unit rate 
Poisson process. 

KEY WORDS: Biomedical; Determinantal point process; Latent structure; Re¬ 
pulsive; Reversible jump Markov chain Monte Carlo. 
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1 Introduction 


Independent priors for latent structure are almost never appropriate in biomedical infer¬ 
ence. Nevertheless, they are widely used, simply for technical convenience and the lack 
of good alternatives. In this paper we argue for an attractive class of such alternative 
models in typical inference problems in biostatistics and bioinformatics. 


We discuss the use of the determinantal point process (DPP) for modeling latent 
biologic structure. In particular, we focus on mixture models and feature allocation 
problems, when the latent components are to be interpreted as biologically meaningful 
structure. For example, in the case of a mixture model, we might want to interpret com¬ 
ponents of a mixture as clinically meaningful patient subpopulations. Similarly, when 
using feature allocation to model latent tumor cell subpopulations we might want to in¬ 


terpret the features as substantially distinct subclones (Xu et ah, 2015). In both cases, an 
important aspect of the problem is the preference for the latent elements being diverse. 
Such inference is poorly formalized by traditionally used independent priors. We suggest 
the DPP prior as an attractive alternative to implement repulsive priors. The use of 
the DPP for mixture models is not novel. It was originally proposed in Affandi et al. 


(2013), but remains curiously under-used in biomedical literature. The contribution of 
the following discussion is the emphasis on problems with small to moderate size mix¬ 
tures, the extension to inference for general latent structure, and the detailed posterior 
Markov chain Monte Carlo (MCMC) scheme, including easy to implement transdimen- 
sional posterior simulation across different size latent structures. 


For the moment we restrict attention to parametric mixture models, to be specific 
and also because such models are perhaps the most common models for latent structure 
in biomedical applications. For example, popular Bayesian models for clustering and 
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inference on patient subpopulations are variations of the following model. Let y l denote 
a response for the i-th patient. We assume 

H 

yi~^2w h p(yi \p h ), ( 1 . 1 ) 

h =1 

i — 1,... , n, including possibly H = oo. The component-specific sampling model p(y i \ 
Ph) could be, for example, a survival model with parameters p k , possibly including a 
regression on patient covariates. The use of independent priors for component-specific 
parameters ph then gives rise to concerns about over-fitting that generates redundant 
mixture components with similar parameters, leading to unnecessarily complex models 
and poor interpretability. In particular, such over-fit compromises the interpretation of 


the mixture components as biologically meaningful structure. Rousseau and Mengersen 


(2011) argued that such concerns were asymptotically partially mitigated with carefully 


chosen priors. Alternatively, Petralia et al. (2012) proposed a class of repulsive priors 


for mixture components. The proposed repulsive prior was based on a distance metric in 
which small distances were penalized. They showed that using repulsive priors on location 
parameters resulted in better separated clusters, while keeping the density estimation 
accurate. However, posterior computations are complex and do not readily extend to 
high dimensional cases. 


An alternative interpretation of (1.1) is as a mixture, yi ~ f p(yi \ p)dG(p), with 
respect to a discrete probability measure G = If tl ie model is completed 

with a Dirichlet process (DP) prior on G the popular DP mixture model is obtained. 


See, for example, Ghoshal (2010) for a review of such nonparametric Bayesian models. 


Importantly, the DP prior includes independence across ph- 


For later reference note that (1.1) can be equivalently written as a hierarchical model 
with latent indicators s*, 


yi \ Si = k~ p{yi | Pk) and p(si = k) = w k . 


( 1 . 2 ) 
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Interpreting the latent indicators as cluster membership indicators, model (1.2) includes 


inference on a random partition s = (si,... ,s n ) of {1,..., n}. Let Sk = {i : = k} 

denote the k- th cluster. To avoid the notion of empty clusters, that is \Sk\ = 0, we 
re-arrange the indexing of the /i^ to start with h — 1 ,,K corresponding to non-empty 
clusters. Again, an independent prior on the cluster-specific parameters /i/i complicates 
a meaningful interpretation of posterior inference on the random partition s. 


In this paper we argue for an alternative model that replaces the independent prior on 
Hh by the repulsive DPP (Macchi, 1975 ). Recent reviews of the DPP appear in Lavancier 


et al. (2015) and, specifically for finite state spaces, in Kulcsza and Taskar (2012). The 


use of the DPP as a prior for statistical inference in mixture models, we believe, is first 


discussed in Affandi et ah (2013). The main contributions of this paper are the recog¬ 
nition of the DPP as an attractive prior for latent features in general latent structure 
models, including mixture models and latent feature allocation as specific examples; the 
discussion of DPP mixtures specifically when one wants to interpret latent structure as 
biologically meaningful features; and an easily implemented posterior simulation scheme 
for a moderate number of latent structures, as is typical for biomedical inference prob¬ 
lems. Posterior simulation is implemented as a variation of reversible jump (RJ) MCMC 


simulation (Green, 1995) for a density with respect to the unit rate Poisson process. 


2 Motivating Example 


Magnetic resonance imaging (MRI) is an effective technique for studying the human 
brain. For example, MRI volume estimates of white matter (WM), gray matter (GM), 
cerebrospinal fluid (CSF), and their spatial distribution help the diagnosis of degenera¬ 
tive brain illnesses, like Alzheimer’s disease (DeCarli et ah, 1992). Therefore, accurate 
clustering of MRI data according to tissue types is vital to diagnosis and clinical research. 
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To illustrate, we download a sample of simulated imaging data from BrainWeb (Cocosco 


for slice number 92. Figure [1^ depicts the ground truth components for 
CSF, WM and GM. We implement inference under model-based clustering with a DPP 
prior and a similar model based on the widely used Dirichlet process mixture (DPM) 
model. Model details will be discussed later. For the moment we only intend to highlight 
the nature of the inference under the two models to motivate the upcoming discussion. 
Figure 0^ shows the posterior distribution p(K \ data ) on the number of clusters esti¬ 
mated under the DPP prior (left panel) and the DPM prior (right panel). As shown in 
Figure 0 b the DPP clustering model identifies four clusters, three of which match the 
simulation truth and the last one is simulated noise. In contrast, inference under the 
DPM model finds seven clusters, only three of them having a meaningful explanation. 


et ah, 1997 


3 Determinantal Point Process (DPP) 


3.1 Definition 


The DPP defines a point process on S C that is, a random point configuration X = 
{xi,... ,xk} with Xk G S. We first define it for a finite state space, S = {ay,... ,uyv}. 
Let C denote an (.N x N) positive semidehnite matrix, constructed, for example, as 
Cij = C(cui,ujj) with a covariance function C'(ay, ujj). Let Ca denote the sub matrix of 
rows and columns indicated by A C S. In later applications we will identify as /i*. in 


mixture models like (1.2), latent feature allocations etc. For the moment we consider a 


generic random point configuration X, defined as 

p{X = A) = det(CU)/det(C + /) 


(3.1) 


as a probability distribution on the 2 N possible point configurations X C S. This defines 


a subclass of DPPs known as L-ensembles. It is easy to see why (3.1) defines a repulsive 


point process if one interprets the determinant as the volume of a parallelotope spanned 
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by the column vectors of Ca- Equal or similar column vectors span less volume than 


very diverse ones. Equation (3.1) can be shown to imply the marginal probabilities 


p(A C A") oc det(Myi) 


(3.2) 


for M = C(I + C) 1 (Kulesza and Taskar, 2012), where Ma is a submatrix of M. 


Equation (3.2) defines a DPP on a hnite state space S. Every L-ensemble is a DPP. 


But not every DPP is an L-ensemble. For singular (/ — M) we can define (3.2), but not 


(3.1). A good review of DPP models for hnite state spaces, including the derivation of 


the normalizing constant in (3.1) appears in Kulesza and Taskar (2012). 


For a continuous state space S C 57°, we define an L-ensemble by a density f(X ) 
with respect to the unit rate Poisson process as 


S(X) = det(Cv)/ + 1). 


(3.3) 


h =1 


for X = {xi,... ,xk}. As before, Cx is a (K x K ) matrix with (i,j) entry defined by a 
continuous covariance function C{x^Xj). The A^’s are the eigenvalues of the associated 
kernel operator J s C(x,y)h(y)dy. Similar to the case of a hnite state space, it is possible 
to generalize ( 3.3[ ) to the slightly larger class of DPP models (Lavancier et al. 2015). 
However, for the rest of this discussion we will consider L-ensembles and work with the 
kernel C(xi,Xj ) only. 

For continuous DPP kernels, the eigenvalues A/, are generally unknown except for a 
few kernels such as a squared exponential kernel. Several numerical methods are used to 


approximate eigenvalues and corresponding eigenfunctions (Lavancier et al. 2015). We 


build on Kulesza and Taskar (2010) and decompose the kernel function C as 


C{x, x') — q(x)c(x, x')q(x') x,x'E X and c(x,x) = 1, 


(3.4) 


where q(x) is the quality function and c(x, y) is the similarity kernel. For a multivariate 
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x = (xi,..., ip) G we use 


D 1 ( r 2 1 f D 

gW = II AT" exp )~9^f and <*.*') = expj- 

l 2 <G 1 £; 


I \2 


(Xd - x 'd) 
9 2 


for which Zhu et al. (1998) gives analytic results for the eigenvalues and eigenfunctions. 


Eigenvalues A h are given by: 


d= 1 2 

where fa = (hi, ..., ho) is a multivariate index, a = b = p, and c 
Here, 6 and cr q are hyperparameters that define the kernel function. 


(3.5) 
\J a 2 + 2 ab. 


We write X ~ DPP(C, 6, a q ) for X = {xi,..., xk} generated by a DPP model with a 
kernel function C(-, •) that is indexed with parameters 9 , a q , and we write DPP(C) when 
C(-, •) involves no unknown hyperparameters. 


3.2 Posterior Simulation 


Later we will use the DPP as prior probability model for latent structure, including 
latent clustering and feature allocation. In both cases, an important step in the posterior 
simulation will be a transition probability to change the number of atoms in the DPP. We 
discuss a reversible jump (RJ) scheme to implement such transition probabilities using the 


density (3.3) with respect to the unit rate Poisson process. Let Vt K denote the a-algebra 


for size K point configurations, and D = We define an MCMC transition 

probability that allows a move from Fk G £Ik to Fk+ i G &k+i or Fk- i G i- The 
algorithm combines the MCMC simulation for a point process from Geyer and Mpller 


(1994) with the deterministic transformation that is included in the reversible jump (RJ) 


scheme of Green (1995). The construction parallels the construction of Green (1995), 


with only a minor variation that is needed to reduce the integral with respect to the unit 
rate Poisson process to an integral with respect to Lebesgue. 






















Assume the current state is x = {xi,... ,xk} and we consider two transition proba¬ 
bilities, P u (dy | x) which proposes a move to a size K +1 point configuration (“up” move) 
and Pd(dx \ y ) which proposes a move to a size K — 1 point configuration (“down” move). 
For example, P u could be proposing to split one of the atoms in x into two daughters, 
thereby incrementing K by one; and Pd could involve merging two points in x. Let q(x) 
denote the probability of choosing P u , and let A u (x,y) and A d (y,x) denote the accep¬ 


tance probability for a proposal y. Finally, let /(x) denote the density (3.3) with respect 


to the unit rate Poisson process //(■). The detailed balance condition becomes 


(! -q{y)) 


h K +1 


A d (y, x)P d {dx | y) 


<f k 


f(y)dKy) = 


/ q(x) 

/ A u (x,y)P u (dy \ x) 

Jf k 

d F k +i 


f(x)dji(x). (3.6) 


Assume that there are n up (x) possible up moves, j = 1,..., n up (x). For example, if 
the up move involves splitting one of the atoms of the size K point configuration x, we 
could choose one of the n up (x) = K points to split. Let q U j(x) denote the probability 
of selecting the j'-th transition probability. That is P u {dy \ x) = JT q U j(x)P U j(dy \ x). 
Similarly, P d (dx \ y) = j Qdj(y)Pdj{dx \ y). A sufficient condition for detailed balance 
is that equation (3.6) holds for pairs of moves, P U j,P d j' that are defined and linked 


in the following sense. We assume that P U j is constructively defined by (i) generating 
an auxiliary variable u ~ q u (u \ x); (ii) a deterministic, invertible transformation y = 
T(x,u)] and (iii) the matching down move P d j> is defined by x = Tfi (y). Here Tf 1 (y) 
denotes the first element of T~ l (y) = (x,u). The detailed balance condition becomes 


(1 - q(y))qdj'(y) [A d{y,x)I{x = T x 1 (y) e F K ; ye F K + 1 }] f{y)dn{y) = 


q(x)q uj (x) 


A u (x, y)I{x e F K ] y = T(x,u ) e F K+1 }q u (u \ x)du 


f(x)dy,(x) 


We replaced the range of integration by an indicator for x G F K and y G F K+X . Next 
we use f F ^ h(x)dy(x) = f h(x)I(x G F K ) dx x ■ ■ ■ dxx- That is, a unit rate Poisson 
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process restricted to size K point configurations looks exactly like K i.i.d. uniform 


random variables on S (Kingman 1992). The extra factor e - ^ \S\ K /K\ arises from the 
probability of a size K point configuration. Note that x = {x \remains the 
(unordered) point configuration. We get 


(i - q(y))<idj'(y) [A d (y,x)i{x e F k - ye f k+1 }] 


f(y ) 


(K + 1)! 


dyi ■ ■ ■ dyx+i — 


q{x)q uj (x) 


A u (x, y)I{x G F k : y G F K+1 }q u (u \ x)du 


f(x) 

K\ 


dx\---dxK, (3.7) 


still using x — T ] 1 (y) on the left and y = T(x,u ) on the right hand side. Finally, 
we use a change of variables, substituting dyi ■ ■ ■ dyx+i by dx\ ■ ■ ■ dxxdu\J\ with the 


Jacobian J = dT/dx i • ■ ■ dxxdu. A sufficient condition for ( |3.7[ ) is the equality of the 
two integrands, (1 - q(y))q^>(y) A d (y,x 


K+i\J\ = Q(z)q uj (x) A u (x,y) q u (u \ x)f(x), for 


x G Fk and y = T(x,u) G Fk+ i- The condition is verified for A u (x,y) = min{l, p(x, y)} 
and Ad(y,x) = min{l, l/p(x,y)} with 

f(y) i -q(y)qdf(y) 1 


p(x,y) = 


(K + 1 )f(x) q(x) q U j(x) q u {u \ x) 


\J\. 


(3.8) 


Acceptance probability (3.8) defines essentially the RJ algorithm of Green (1995). The 
only minor difference is the extra step of representing the probability of a point configura¬ 
tion with respect to the unit rate Poisson process by a probability of the ordered Jl-tuple 


(,x'i,...,%). Geyer and Mqller (1994) use the latter for a birth and death Markov chain 
Monte Carlo, and without the deterministic transformation. For posterior simulation 


conditional on data y ~ p(y \ x,8) multiply with an additional likelihood ratio in (3.8). 
Here 6 are additional parameters in the sampling model, beyond x. 


In summary, we have shown that the density with respect to the unit rate Poisson pro¬ 
cess can be used to construct a RJ MCMC, essentially as if it were a density with respect 
to Lebesgue. A similar argument holds for Metropolis-Hastings transition probabilities, 
without a change in the size of the point configuration x. 


10 




















4 DPP Clustering 


4.1 Motivation and Model 


Clustering is fundamental to exploratory analysis of bioinformatics data. For instance, 
elucidating patterns of gene expression and identifying sets of genes that behave similarly 
under certain biologic conditions is important in the study of functional genomics and 
proteomics. Clustering also can be applied to develop targeted therapies. We first cluster 
the patient samples into several subgroups based on protein activation (or some other 
patient baseline characteristics), then correlate patient clusters with overall survival and 
investigate subgroup-specific therapies. These and similar applications in biomedical 
inference motivate the following model. 


We start with a mixture of normals sampling model, as it is widely used in clustering 
and density estimation. Here we show simulation with the univariate sampling model 


(1.2). In Web Appendices A and B we show a straightforward extension to a multivariate 
mixture, including a brief simulation study. We assume that data y n = {yi}™ =1 are 
generated from yi ~ J2k= i w kN(- | ^, 0 ^), i = 1 ,...,n, with unknown K. This is a 


special case of (1.1) with random K and with a normal kernel p{jji \ The model 


implies a prior for a random partition s = (si,..., s n ), as in (1.2). Often the inference goal 
is to identify latent clusters Sk = {? : St = k} that correspond to meaningful biologic 
conditions or to identify subpopulations that are sufficiently diverse to be considered 
for different clinical decisions such as treatment allocation. The protein data analysis 


for kidney cancer patients, in Section |4.2[ is a typical example. In such problems an 
independent prior on has the undesirable feature of allowing for very similar, or 
even identical (in the case of a discrete parameter space) /!&. To interpret different 
terms in the mixture as meaningful structure in the population, we prefer a repulsive 
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prior on the that is, a probability model that favors a priori very distinct values 
/ifc. The repulsive property and the relative computational simplicity make the DPP an 


appealing choice. Kwok and Adams (2012) applied the DPP as repulsive prior in latent 


variable models. However, for lack of efficient computational algorithms their method 


was restricted to MAP (maximum a posterior) inference. Affandi et al. (2013) proposed 


a Gibbs sampling technique for inference with DPP priors under fixed K (A'-DPP). The 
posterior simulation scheme from Web Appendix A allows us to implement inference 
under an unconstrained DPP prior, including a random size (K) point configuration. 


The DPP mixture model. We complete the sampling model (1.2) with a DPP prior 
on the cluster-specific parameters fi k '- 


Di | Si = k ~ p(yi | p k ) and p(s* = k) = w k . 
R = {//a,..., p K } ~ DPP(C, 0, a q ), 


(4.1) 


using the kernel function in (3.4). Recall that 9,a„ are hyperparameters in the definition 


of C. Finally we use hyperpriors w \ K,5 ~ Dir(5,..., 5), 1 /o\ ~ Ga(a 0 ,&o), 9 ~ 

N(ai 1 b\) 1 and a q ~ N(d 2 ,bl). Here Ga(a, b) refers to a Gamma distribution with mean 
a/b. The model can be easily extended to multivariate responses using multivariate 
normal and inverse-Wishart priors. Posterior inference is carried out using MCMC sim¬ 
ulations. Details are shown in Web Appendix A. 


Two simulation studies. We carry out two simulation studies to evaluate the per¬ 
formance of the repulsive DPP prior in clustering and density estimation, with both 
univariate and multivariate responses. Results are summarized in Figure [2} Details of 
the simulation setup and more results are shown in Web Appendix B. See there also for 
more discussion of Figure [2] and for a statement of the multivariate version of the DPP 
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mixture model. The results show that the DPP prior leads to a sparser representation 
with interpretable clusters compared to DPM, while maintaining a good fit for the den¬ 
sity estimate, making it a preferable prior model for applications where such parsimony 
is desired. 


4.2 KIRC Protein Data Analysis 


We implement inference under the proposed DPP mixture model for protein expression 


data from Yuan et al. (2014) with n = 243 samples and D = 17 protein markers for 


kidney renal clear cell carcinoma (KIRC). See equation (4) in Web Appendix A.2 for a 
statement of the DPP mixture model (4.1) with a multivariate normal kernel p(y i \ /if.). 
Inference goals include correlating protein expression with patients’ overall survival. The 
n = 243 KIRC samples are classified into three clusters by the proposed DPP mixture 
model. As shown in Figure [3}i, patients stratified by these three DPP groups exhibit 
very distinct survival patterns (p- value under a log-rank test is p — 0.00027). Proteins 
that are correlated with better prognosis are relatively elevated in cluster 2 (the best 
survival group) while the proteins correlated with worst survival are relatively elevated 
in clusters 1 and 3 (especially cluster 3, the worst survival group) (Figure |3 ]d). These 
results suggest that inference under the DPP prior can successfully classify patients into 
biologically meaningful groups based on molecular profiles. 


In contrast, under the DPM mixture model, K = 7 clusters are identified, estimated 
as the mode of p(K \ y) shown in Figure [3jx Most of the seven clusters have small size 
(< 20) while 2/3 of the samples are allocated to one cluster (the red bar in Figure [3ji). 
The clusters are not easily interpreted (Figure |3]i). 

In summary, inference under the DPP prior provides fewer clusters and gives more 
interpretable results in molecular profile-based classifications than inference under a com- 
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parable DPM prior. 


5 A DPP Feature Allocation Model 


5.1 Motivation and Model 

Breast cancer is a heterogeneous disease in terms of molecular alterations and clinical 
responses. Gene expression profiling can provide valuable information for understanding 
this complexity and consequently for predicting clinical outcomes. Here we consider a 


study reported in Chen et al. (2013) who aim to characterize gene expression profiles by 


a small number of underlying distinct molecular drivers. These latent molecular drivers 
should be linked to different subsets of samples. This motivates us to propose the model 
below which formalizes this preference by using a DPP prior for the pattern of how 
molecular drivers (the columns of the matrix Z below) are linked to samples (rows of 
Z). Let Y denote the observed n x S data matrix with rows representing samples and 
columns representing genes. Let Z be an n x K binary matrix with Zik = 1 if molecular 
driver k presents in sample i , and 0 otherwise. That is, the k -th column Zk defines the 
subset Gk = {i : — 1} of samples that are linked with the k-th molecular driver. 

The entire matrix Z defines a multiset {Gk, k = 1,... ,K}. Such multisets are known 


as feature allocation (Broderick et al. 2013b) and are popular tools in machine learning 


to implement inference about overlapping subsets of experimental units (customers etc.). 
The special case of non-overlapping subsets that cover all samples, that is, Gk fl Gi = 0 


and (J Gk = {1,... , n}, is a partition. See Broderick et al. (2013a) for a recent review. 
We use the feature allocation matrix Z to construct a sampling model for the breast 
cancer gene expression data Y : 

Y = Z(3 + E, (5.1) 
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where (3 is a K x S loading matrix with each entry j3 k j weighing the contribution of gene 
j to the driver k and E = [e^] is an error matrix with ~ lV(0,a 2 ), independently. 
This defines a sampling model for the observed gene expressions Y in terms of assumed 
latent structure Z. That is, y VJ ~ N (Ef=i ZikPkj, cr 2 ) • 

The key assumption in a feature allocation model is the prior model on Z. A techni¬ 


cally convenient and traditional prior is the Indian buffet process (IBP) (Ghahramani and 


Griffiths, 2006). One of the key properties of the IBP, in the context of this application, 


is the implied independence across columns of the binary matrix (re-arranging columns 
in left ordered form or by other constraints introduces a trivial form of dependence). This 
independence is undesirable for the desired inference on molecular drivers. In particular, 
independence across columns implies a positive prior probability for identical columns, 
which is meaningless in the interpretation of columns as distinct molecular drivers. 

DPP feature allocation. In contrast to the IBP, a DPP prior on the columns z k 
formalizes the desired parsimony in identifying latent molecular drivers. We assume 

zruz* - z m ) 2 


{zk, k — 1,..., K} ~ DPP(C') with C(ze, z^) = exp 


9 2 


(5.2) 


With large n, there is no effective way to decompose the kernel matrix C (N x N matrix 
with N = 2 n ) and to compute the eigenvalues and the corresponding eigenvectors. We 


therefore fix 6 in (5.2) and complete the model with a conditionally conjugate prior on 


the coefficients, /3 k j 1 ~^' A r (0,r 2 ), and hyperpriors 1/a 2 ~ Ga(ao, bo), 1/r 2 ~ Ga(ai,&i). 
DPP-Ji feature allocation. In the upcoming applications we find it convenient to 


work with a slight variation of model (5.2). Let DPP^(G) denote a DPP prior restricted 
to a fixed number of atoms, K , and define 


{z k , k — 1,..., K}\ K ~ DPP a -(C) and K ~ p(K) 


(5.3) 


for some prior p(K). We refer to the model as the DPP-Jt feature allocation model. 
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The reason for introducing the DPP-Ji model is that it facilitates a computationally 


efficient posterior simulation scheme. Under (5.2) we require a RJ type implementation, 


following the general scheme in Section 322 However, in some applications it is difficult to 
construct proposal distributions that lead to reasonably mixing Markov chains. Instead 


we propose in Web Appendix C an alternative MCMC scheme under (5.3), which we 
find to work well for applications with feature allocation problems. Posterior inference 


in model (5.2) as well as in (5.3) is implemented by MCMC posterior simulation. Details 
are shown in Web Appendix C. 


Simulation study. We carry out a simulation study to compare inference under the 
DPP-A' feature allocation prior versus the IBP prior. Results are summarized in Fig¬ 
ure [4j More discussion of the results and details of the simulation setup are in Web 
Appendix D. With fewer latent features, inference under the DPP prior can better and 
more parsimoniously recover the simulation truth than under a standard IBP prior in 
this simulated example. 


5.2 Breast Cancer (BRCA) Data Analysis 


We analyze the TCGA BRCA rnRNA expression data (The Cancer Genome Atlas Net¬ 


work et al. 2012). We focus on n = 150 tumor samples classified as basal-like, HER2- 


enriched (HER2) and luminal A (LumA) subtypes by PAM50, a well-established 50-gene 
signature for distinguishing the gene expression-based “intrinsic” subtypes of breast can¬ 


cer (Parker et al. 2009). Among those three subtypes, the HER2-enriched subtype is 


well studied. There are effective therapeutic drugs developed for targeting HER2 breast 
cancer. The basal-like subtype (also known as triple-negative breast cancer due to its 
lacking of expression of estrogen receptor (ER), progesterone receptor (PR) and HER2), 
and the LumA subtype, which is known to have lowest overall mutation rate, are poorly 
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understood. As a result, there is currently no effective targeted therapy for these two 
subtypes, leaving chemotherapy as the main therapeutic treatment. A better charac¬ 
terization of basal-likc and LumA subtypes at the molecular level is needed for clinical 
studies. 

We implement inference under the proposed DPP-A' latent feature model and identify 
K = 3 latent features. Figure [o}i shows the posterior inferred latent feature matrix Z 
with different breast cancer subtypes samples marked by different colors on the left. The 
basal-like, HER2 and LumA samples show clear and distinct patterns: 35 of 50 basal-likc 
samples exhibit the first and third features and 44 of 50 are depleted with respect to the 
second feature; 43 of 50 HER2 samples exhibit the first two features and 33 of 50 are 
depleted with respect to the third feature; 48 of 50 LumA samples exhibit the second and 
third features and 47 of 50 are depleted with the first feature. More biological findings 
are discussed in Web Appendix E. 

For comparison, we analyze the same BRCA dataset under a model with an IBP prior. 
It identifies 27 latent features, of which 17 are active in less than 4 samples. Figure [5 Jd 
shows the estimated latent feature matrix Z. The first three features identified under the 
IBP prior can distinguish different breast cancer subtypes: 44 of 50 basal-like samples 
exhibit the first feature; 4 of 50 HER2 samples and none of LumA samples exhibit the 
first feature; 48 of 50 LumA samples exhibit the third feature. However, for the remaining 
24 features, we can not observe any pattern for different breast cancer subtypes: these 
features were sparsely scattered across all samples. This is a good example of how the 
independent prior across features, as it is implied in the IBP model, leads to a lack of 
parsimony and difficult interpretability in the latent structure. In summary, the DPP 
prior model provides a less complicated representation and more interpretable features 
than inference under the IBP prior model. 


17 


6 Conclusions 


We argue for the use of repulsive priors in models that involve latent structure as the 
main inference target. In many such problems interpretation of the imputed latent struc¬ 
ture favors diverse and parsimonious choices. We specifically discuss examples involving 
inference for mixture models and feature allocation models. In these settings, commonly 
used models assume independence across latent clusters, features etc., which is techni¬ 
cally convenient, but often inappropriate for the desired inference. We instead propose 
the use of DPP models as repulsive priors. The DPP model is attractive mainly because 
of the availability of easy to implement posterior simulation schemes. 

We compare inference using DPP priors with standard Bayesian nonparametric priors 
in the cluster analysis of renal clear cell carcinoma and a feature allocation analysis of 
TCGA BRCA mRNA expression data. Our examples show that using DPP priors leads 
to posterior inference that gains substantially in parsimony and interpretability. Our case 
study results are methodologically corroborated by our analysis of the inferred structures. 
Also, DPP priors lead to a noticeable reduction in model uncertainty and, consequently, 
significantly more efficient estimators of latent structures. 


Beyond mixture models and feature allocation models, inference for latent structures 
arises naturally in many other biomedical applications. One class of such examples are 
applications that involve nested clustering, that is, clustering of one set of experimental 
units (e.g., proteins) with respect to shared nested partitions on another set of exper¬ 


imental units (e.g., patients). Lee et al. (2013) discussed such applications, but with 
independent priors across distinct nested partitions. In some contexts, the latent struc¬ 
ture of interest could be a graph, for example, a conditional independence graph that 


might be shared across some subpopulations. For example, Mitra et al. (2015) considered 
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dependence structure of histone modifications across different conditions. 


Several important limitations remain. In some applications repulsive priors are inap¬ 
propriate. For example, inference for tumor heterogeneity might involve a prior across 
latent hypothetical subclones. However, following the notion of a phylogenetic tree of 
tumor cell subpopulations, some of these latent subclones should differ by few features 
(mutations, copy number variations etc.) only. Also, important computational limita¬ 
tions may be encountered, depending on specific applications. For example, in big data 
settings, fast posterior approximations developed for standard prior models may not ex¬ 


tend directly to the case of DPP priors (Xu et al. 2015). Finally, problems related to 


label-switching (Jasra et ah, 2005) remain an issue like in any mixture model. This is 
the case because the DPP prior remains exchangeable, for example across the /i*. in the 


mixture model (4.1). 


7 Supplementary Materials 

Web appendices and figures referenced in Sections 4.1, 5.1 and 5.2 are available with this 
paper at the Biometrics website on Wiley Online hibrary. 
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CSF 


WM 



(a) True tissue types 


Histogram of number of clusters by DPP Histogram of number of clusters by DPM 



(c) Estimated number of tissue types by 
DPP (left) and DPM (right) 



(b) Estimated tissue types by DPP 



(d) Estimated tissue types by DPM 


Figure 1: Brain Web images. Panel (a) shows the three true tissue types: CSF, WM and 
GM. Panel (b) shows the estimated tissue types under the DPP prior. Panel (c) shows 
the estimated number of tissue types by DPP (left) and DPM (right). Panel (d) shows 
the estimated tissue types under the DPM prior. 
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Histogram of number of clusters by DPM 


Histogram of number of clusters by DPM 



a = 0.5 


a = 1.5 


Figure 2: Simulation: DPP mixture model. The upper panel shows the histograms of two 
simulated datasets with true density (black), estimated density by DPP prior (red) and 
DPM prior (green). The lower panels present the histograms of the estimated number of 
clusters by DPP prior (2nd row) and DPM prior (3rd row). 
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DPP subtypes based on KIRC protein expression data 
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Figure 3: KIRC data. Panel (a) shows a Kaplan-Meier plot of overall survival in the 
KIRC core set stratified by three clusters identified under DPP prior. Panel (b) shows 
the top differentially expressed protein markers among three DPP clusters. Columns 
correspond to patients, rows correspond to proteins. Panel (c) is the histogram of the 
number of clusters identified under DPM prior. Panel (d) shows a heatmap of seven 
DPM clusters for top differentially expressed protein markers. The sizes of the seven 
clusters are 163, 19, 39, 14, 6, 1 and 1, respectively. 
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(d) True (3° 


(e) Estimated (3 by DPP 


(f) Estimated (3 by IBP 


Histogram of number of latent features by DPP 


Histogram of number of latent features by IBP 



nfeatures 


nfeatures 


(g) ( h ) 

Figure 4: Simulation: DPP feature allocation model. Panels (a-c) show the true feature 
allocation matrix Z° and the estimate Z under the DPP prior and the IBP prior, respec¬ 
tively. Panels (d-f) show the true feature mean (3° and the estimated (3 under the DPP 
prior and under the IBP prior, respectively. Panels (g-h) are histograms of the number 
of latent features identified under DPP prior and IBP prior, respectively. 
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Figure 5: BRCA data. Estimated feature allocation matrix Z under the DPP prior 
(panel a) and under the IBP prior (b). 
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